post thumbnail

Python Data Cleaning for Web Scraping: JSON, MongoDB, and Regex Techniques

Master Python data cleaning for web scraping: JSON parsing, MongoDB storage, JSONP handling with demjson, and regex extraction. Learn to process Star Wars API data, deduplicate MongoDB entries, and extract Quora follower counts efficiently. Boost your scraping skills with these pro methods.

2025-09-17

This article continues from 10 Essential Python Data Cleaning Techniques for Web Scraping, focusing on practical data cleaning methods that modern scraping projects rely on daily.

Most real-world scraping tasks involve APIs, JavaScript-rendered data, and unstructured text. Therefore, mastering Python data cleaning for web scraping is essential for building stable and scalable crawlers.


3. JSON Data Cleaning

Today, most websites expose data through APIs, and JSON has become the dominant response format. As a result, Python developers must handle JSON efficiently.

Example: Star Wars API (SWAPI)

API endpoint:

https://swapi.dev/api/people
import requests

url = "https://swapi.dev/api/people/"
response = requests.get(url, verify=False)
json_data = response.json()

print(json_data["results"])  # Access character data

Handling Non-English Characters

When JSON contains non-English text, such as Chinese characters, you should explicitly set the encoding:

response.encoding = "utf8"
json_data = response.json()

This step prevents garbled text and ensures accurate downstream processing.


4. Storing JSON Data in MongoDB (NoSQL)

When JSON structures become deeply nested, traditional SQL databases introduce unnecessary complexity. In contrast, MongoDB handles nested documents naturally, making it a strong choice for Python data cleaning for web scraping.

Installation

pip install pymongo

Insert JSON Data into MongoDB

import pymongo

client = pymongo.MongoClient(
    f"mongodb://{user}:{password}@{host}:{port}"
)
db = client["db_spider"]
collection = db["wars_star"]

# Prevent duplicate insertion
collection.create_index("name", unique=True)
collection.insert_many(json_data["results"], ordered=False)

Query Examples

Find characters whose names contain “Le”:

db.getCollection("wars_star").find({ name: /Le/ })

Find characters appearing in a specific film:

db.getCollection("wars_star").find({
  films: { $in: ["https://swapi.dev/api/films/1/"] }
})

Because MongoDB supports flexible schemas, it simplifies storage and querying of API responses with variable fields.


5. Handling JavaScript Object Data (JSONP)

Some websites return data wrapped inside JavaScript objects rather than pure JSON. Financial websites often use this pattern.

Example: Parsing JavaScript Object Data

import demjson

# Extract JavaScript object
js_data = response.text[
    response.text.find("=") + 2 : response.text.rfind(";")
]

# Decode JavaScript object
raw_data = demjson.decode(js_data)

rank_list = [item.split(",") for item in raw_data["datas"]]

This approach allows you to convert JavaScript-style data into structured Python objects without browser automation.


6. Regular Expressions: The Universal Tool

Even with structured APIs, some data only appears inside raw HTML or text. In such cases, regular expressions provide a reliable fallback.

Single Match with 

re.search

import re

html = '<div class="q-text">6,526 followers</div>'
match = re.search(r">(.*?) followers<", html)

followers = int(match.group(1).replace(",", ""))
print(followers)  # 6526

Multiple Matches with 

re.findall

text = "Phone numbers: 18767543212 and 19767443218"
phones = re.findall(r"\d{11}", text)

print(phones)
# ['18767543212', '19767443218']

Regular expressions remain indispensable when APIs are unavailable or page structures change frequently.


When to Use Each Technique

ScenarioRecommended Method
API responsesJSON parsing
Nested or flexible schemasMongoDB
JavaScript-returned objectsJSONP + demjson
Unstructured HTML/textRegular expressions

In practice, effective Python data cleaning for web scraping combines multiple techniques rather than relying on a single solution.


Conclusion

In this chapter, you learned how to clean and process scraped data using JSON parsing, MongoDB storage, JavaScript object handling, and regular expressions. These techniques cover the majority of real-world scraping scenarios and integrate smoothly with larger crawling pipelines.

For more on extracting raw HTML data before cleaning, see:

Crawling HTML Pages: Python Web Scraping Tutorial

https://www.2808proxy.com/practical-application-of-crawler

In the next installment, we will explore more advanced data cleaning strategies that further improve crawler efficiency and data quality.